35 research outputs found
Optimistic Exploration even with a Pessimistic Initialisation
Optimistic initialisation is an effective strategy for efficient exploration
in reinforcement learning (RL). In the tabular case, all provably efficient
model-free algorithms rely on it. However, model-free deep RL algorithms do not
use optimistic initialisation despite taking inspiration from these provably
efficient tabular algorithms. In particular, in scenarios with only positive
rewards, Q-values are initialised at their lowest possible values due to
commonly used network initialisation schemes, a pessimistic initialisation.
Merely initialising the network to output optimistic Q-values is not enough,
since we cannot ensure that they remain optimistic for novel state-action
pairs, which is crucial for exploration. We propose a simple count-based
augmentation to pessimistically initialised Q-values that separates the source
of optimism from the neural network. We show that this scheme is provably
efficient in the tabular setting and extend it to the deep RL setting. Our
algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments
the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure
optimism during both action selection and bootstrapping. We show that OPIQ
outperforms non-optimistic DQN variants that utilise a pseudocount-based
intrinsic motivation in hard exploration tasks, and that it predicts optimistic
estimates for novel state-action pairs.Comment: Published as a conference paper at ICLR 202
Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
In many real-world settings, a team of agents must coordinate its behaviour
while acting in a decentralised fashion. At the same time, it is often possible
to train the agents in a centralised fashion where global state information is
available and communication constraints are lifted. Learning joint
action-values conditioned on extra state information is an attractive way to
exploit centralised learning, but the best strategy for then extracting
decentralised policies is unclear. Our solution is QMIX, a novel value-based
method that can train decentralised policies in a centralised end-to-end
fashion. QMIX employs a mixing network that estimates joint action-values as a
monotonic combination of per-agent values. We structurally enforce that the
joint-action value is monotonic in the per-agent values, through the use of
non-negative weights in the mixing network, which guarantees consistency
between the centralised and decentralised policies. To evaluate the performance
of QMIX, we propose the StarCraft Multi-Agent Challenge (SMAC) as a new
benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a
challenging set of SMAC scenarios and show that it significantly outperforms
existing multi-agent reinforcement learning methods.Comment: Extended version of the ICML 2018 conference paper (arXiv:1803.11485
A New Take on Detecting Insider Threats: Exploring the use of Hidden Markov Models
The threat that malicious insiders pose towards organisations is a significant problem. In this paper, we investigate the task of detecting such insiders through a novel method of modelling a user's normal behaviour in order to detect anomalies in that behaviour which may be indicative of an attack. Specifically, we make use of Hidden Markov Models to learn what constitutes normal behaviour, and then use them to detect significant deviations from that behaviour. Our results show that this approach is indeed successful at detecting insider threats, and in particular is able to accurately learn a user's behaviour. These initial tests improve on existing research and may provide a useful approach in addressing this part of the insider-threat challenge
QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
In many real-world settings, a team of agents must coordinate their behaviour
while acting in a decentralised way. At the same time, it is often possible to
train the agents in a centralised fashion in a simulated or laboratory setting,
where global state information is available and communication constraints are
lifted. Learning joint action-values conditioned on extra state information is
an attractive way to exploit centralised learning, but the best strategy for
then extracting decentralised policies is unclear. Our solution is QMIX, a
novel value-based method that can train decentralised policies in a centralised
end-to-end fashion. QMIX employs a network that estimates joint action-values
as a complex non-linear combination of per-agent values that condition only on
local observations. We structurally enforce that the joint-action value is
monotonic in the per-agent values, which allows tractable maximisation of the
joint action-value in off-policy learning, and guarantees consistency between
the centralised and decentralised policies. We evaluate QMIX on a challenging
set of StarCraft II micromanagement tasks, and show that QMIX significantly
outperforms existing value-based multi-agent reinforcement learning methods.Comment: Camera-ready version, International Conference of Machine Learning
201
The StarCraft Multi-Agent Challenge
In the last few years, deep multi-agent reinforcement learning (RL) has
become a highly active area of research. A particularly challenging class of
problems in this area is partially observable, cooperative, multi-agent
learning, in which teams of agents must learn to coordinate their behaviour
while conditioning only on their private observations. This is an attractive
research area since such problems are relevant to a large number of real-world
systems and are also more amenable to evaluation than general-sum problems.
Standardised environments such as the ALE and MuJoCo have allowed single-agent
RL to move beyond toy domains, such as grid worlds. However, there is no
comparable benchmark for cooperative multi-agent RL. As a result, most papers
in this field use one-off toy problems, making it difficult to measure real
progress. In this paper, we propose the StarCraft Multi-Agent Challenge (SMAC)
as a benchmark problem to fill this gap. SMAC is based on the popular real-time
strategy game StarCraft II and focuses on micromanagement challenges where each
unit is controlled by an independent agent that must act based on local
observations. We offer a diverse set of challenge maps and recommendations for
best practices in benchmarking and evaluations. We also open-source a deep
multi-agent RL learning framework including state-of-the-art algorithms. We
believe that SMAC can provide a standard benchmark environment for years to
come. Videos of our best agents for several SMAC scenarios are available at:
https://youtu.be/VZ7zmQ_obZ0
FACMAC: Factored Multi-Agent Centralised Policy Gradients
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC), a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. Like MADDPG, a popular multi-agent actor-critic method, our approach uses deep deterministic policy gradients to learn policies. However, FACMAC learns a centralised but factored critic, which combines per-agent utilities into the joint action-value function via a non-linear monotonic function, as in QMIX, a popular multi-agent Q-learning algorithm. However, unlike QMIX, there are no inherent constraints on factoring the critic. We thus also employ a nonmonotonic factorisation and empirically demonstrate that its increased representational capacity allows it to solve some tasks that cannot be solved with monolithic, or monotonically factored critics. In addition, FACMAC uses a centralised policy gradient estimator that optimises over the entire joint action space, rather than optimising over each agent's action space separately as in MADDPG. This allows for more coordinated policy changes and fully reaps the benefits of a centralised critic. We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks. Empirical results demonstrate FACMAC's superior performance over MADDPG and other baselines on all three domains